Introduction

In this session, we will have a glimpse of the broad range of tools available to analyse community time series. You will find in Buckley, Day, Case, et al. (2021) a more extensive and complementary introduction to the analyses that we will cover in the session.

Ecological communities are assemblages of species bound together by a shared environment and a network of influence each species has on the others. Ecological communities can also refer to a set of species from a particular biological group, not necessarily interacting, and living in a particular geographical region. Here we will see some temporal analyses that can be applied to communities defined by this second meaning, although many advances have been also done to introduce the time perspective in other areas of community ecology (e.g. in interacting networks (Yang 2020)).

One of the main areas of research in communities is their variation in structure: in the number of species it presents (i.e. richness), in the identity and abundance of species (i.e. composition), and their distribution (i.e. evenness). Here we will mainly focus on temporal changes in community composition, which will necessarily involve a population time series for every species. To deal with this multidimensionality, most analyses will first try to summarise the compositional changes between samples in a unique and univariant descriptor (e.g. a diversity or dissimilarity index), or place the sampled communities in a reduced dimensional space (e.g. a principal coordinate analysis). There are also methods to assess differences in communities directly with the multivariate time series (e.g. a multivariate analysis of variance). Some of the multivariate techniques that we will see in this session might be specially indicated for community data, but might be applicable, with few changes, to other multivariate data sets too.

The data set (Ellison 2021) that we will analyse in this session contains the abundance of the different ground- and soil-nesting ant species found in a hemlock forest of North America between 2003 and 2018 (Record et al. 2018). Previous studies have pointed out the foundation role the hemlock has in this ecosystem by strongly influencing the environment and the community it hosts. “Hemlock-dominated stands are dark, cool, and moist; have acidic, nutrient-poor soils with slow rates of nutrient cycling; and are populated by generally species-poor assemblages of associated plants and animals” (Record et al. 2018). In 2005, the researchers decided to log and remove all hemlock individuals >20 cm diameter from an experimental stand and analyse the effects that the removal of a foundation species would have on the local ant community.

*The Harvard Forest Hemlock Removal Experiment: https://harvardforest.fas.harvard.edu/other-tags/ants
*The Harvard Forest Hemlock Removal Experiment: https://harvardforest.fas.harvard.edu/other-tags/ants

Ok, so let’s start

Load the libraries that we are going to use.

library(tidyverse) # manage data
library(DT) # visualise tables in html
library(viridis) # use color-blind friendly palettes
library(ggrepel) # avoid overlapping texts in the figures
library(codyn) # calculate community dynamics metrics
library(vegan) # calculate community ecology metrics
library(eulerr) # calculate Venn and Euler diagrams

Define your ggplot theme

theme_set(theme_minimal()+
            theme(axis.title = element_text(size = 15),
                  axis.line = element_line(color = "black",
                                           linewidth = 0.5),
                  panel.grid.major = element_blank(),
                  axis.text = element_text(color = "black", size = 12),
                  strip.text.x = element_text(size = 12),
                  axis.ticks = element_line(color = "black")))

Load the data set ants_logged.csv,

ants_raw <- read.csv("ants_logged.csv")

and have a first look at the data

glimpse(ants_raw)
## Rows: 534
## Columns: 13
## $ X          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ year       <int> 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003,…
## $ date       <chr> "2003-06-01", "2003-06-01", "2003-06-01", "2003-06-01", "20…
## $ block      <chr> "Ridge", "Ridge", "Ridge", "Ridge", "Ridge", "Ridge", "Ridg…
## $ plot       <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ treatment  <chr> "Logged", "Logged", "Logged", "Logged", "Logged", "Logged",…
## $ moose.cage <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ trap.type  <chr> "bait", "bait", "bait", "bait", "hand", "hand", "hand", "ha…
## $ trap.num   <chr> "1 hour", "1 hour", "1 hour", "15 minutes", "", "", "", "",…
## $ genus      <chr> "Aphaenogaster", "Aphaenogaster", "Temnothorax", "Aphaenoga…
## $ species    <chr> "fulva", "picea", "longispinosus", "picea", "picea", "penns…
## $ code       <chr> "aphful", "aphpic", "temlon", "aphpic", "aphpic", "campen",…
## $ abundance  <int> 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1,…
ants_raw %>% 
  distinct(block, plot, treatment) # extract the different combinations of block, plot and treatment
ants_raw %>% 
  summarise(n_species = n_distinct(code), # number of different codes (i.e. species)
            n_years = n_distinct(year)) # number of different years
ants_raw %>% 
  distinct(year, date) %>% # extract the different combinations of year and date
  DT::datatable() # a function to improve the visualization of the table in the html output

As you can see, each entry contains the abundance of one of the sampled species during a particular sampling date (i.e. long format). The data set contains the abundances of the 30 different ants detected at the logged hemlock stand in 14 years. Sampling was conducted in summer at least once a year.

Are there any missing values? How are they distributed?

anyNA(ants_raw) # is there any missing value? (Boolean answer)
## [1] TRUE
# how are NAs distributed? And what might NAs exactly mean, in this data set?
# let's see it by year, for example
ants_raw %>% 
  mutate(cell_content = case_when(is.na(abundance) ~ "NA",
                                  abundance == 0 ~ "0",
                                  T ~ ">0")) %>%
  # we create a new variable capturing cell content
  # as we are interested in defining 3 different situations, we use the case_when function
  ggplot(aes(x = year)) +
  geom_histogram(aes(fill = cell_content),
                 binwidth = .5, center = 0) +
  scale_fill_discrete(drop = F) +
  labs(y = "Number of rows")

ants_raw %>% 
  filter(year == 2007) %>% 
  DT::datatable()
# and let's see it by year x species too
ants_raw %>% 
  mutate(cell_content = case_when(is.na(abundance) ~ "NA",
                           abundance == 0 ~ "0",
                           T ~ ">0")) %>% 
  ggplot(aes(x = year, y = code, fill = cell_content)) +
  geom_tile(color = "white") +
  # scale_x_continuous(breaks = seq(from = 1970, to = 2015, by = 5)) +
  labs(y = "species code") +
  theme(axis.text.y = element_text(size = 8, face = "italic"))

The abundance column is composed by natural numbers >0 and missing values, but it does not contain any zero. In addition, NAs are clustered in 2007 and are only specified for 5 different species. This pattern suggests that in 2007 the researchers conducted the sampling and found, but didn’t count, 5 different species in the traps, which is different from not having done the sampling (e.g. 2016 and 2017) or finding no ant during the sampling (the actual 0)

Data cleaning and wrangling

Now that we have had a glimpse of the data set, let’s structure it to fit our needs.

# Adapt the longer data set to a yearly resolution (not date)
ants_long <- ants_raw %>%
  select(year, genus, species, code, abundance) %>% # the variables we will work with
  group_by(year, genus, species, code) %>% 
  summarise(total_abundance = sum(abundance)) %>% # calculate the total abundance per year and species
  ungroup()

Gentle reminder

Many community analyses work with a matrix-like database containing only the abundances of each species (in the columns) at each sampling time (in the rows). How would you obtain it from the ants_long data set?


Explore first

Data exploration is a very important and determinant first step in any analysis and can take up a big proportion of the time spent on it (Zuur, Ieno, and Elphick 2010). It is the moment to get familiar with the data set to select an appropriate analysis, detect potential errors and future pitfalls, and have more clues when interpreting results (Buckley, Day, Case, et al. 2021).

How many species and samples (years) do we have in the final data set?

dim(ants_wide)
## [1] 14 31

14 different years (rows), and 31 different species (columns). This is strange, because when we had a first look at the data we saw that we were dealing with a pool of 30 species (30 distinct species codes).

ants_long %>% 
  summarise(n_code = n_distinct(code), # count the number of different species codes
            n_sp = n_distinct(paste(genus, species))) # count the number of different species names
ants_long %>% 
  group_by(code) %>% 
  mutate(n_sp = n_distinct(paste(genus, species))) %>% 
  filter(n_sp > 1) %>% # keep those rows with more than one species name per code
  distinct(genus, species, code)

There was a typo in the species name of Temnothorax longispinosus, that was creating a “new” species-column in the ants_wide data set. Let’s correct it and redo the two data sets.

ants_long <- ants_long %>% 
  mutate(species = if_else(species == "lognispinosus", # if the specific epithet in that row is wrong
                           "longispinosus", # change it by the correct one
                           species)) # but conserve the specific epithet of the other rows

ants_wide <- ants_long %>%
  pivot_wider(id_cols = year,
              names_from = c(genus, species),
              values_from = total_abundance,
              values_fill = list(total_abundance = 0),
              names_sep = " ") %>% 
  column_to_rownames(var = "year")

dim(ants_wide)
## [1] 14 30

A first example on the importance of paying attention to the first emerging inputs during the data exploration.

How do the number of species change across years?

ants_long %>% 
  group_by(year) %>% 
  summarise(n_species = n_distinct(code)) %>% # count the number of different species by year
  ggplot(aes(x = year, y = n_species)) +
  geom_line() +
  geom_point(size = 3) +
  geom_smooth(se = F) +
  #making the plot fancier:
  geom_text(aes(label = n_species), nudge_y = 1) + # adding the number of species as a label
  geom_vline(aes(xintercept = 2005, color = "logging"), linetype = "dashed") + # indicate logging year
  guides(color = guide_legend(title = NULL)) +
  scale_x_continuous(breaks = function(x) seq(ceiling(x[1]), floor(x[2]), by = 2)) +
  # a sequence taking the limitis of the axis
  labs(y = "richness")

This figure already shows us that there were local extinction and colonisation events in the community after the logging event. But we are only looking at richness (the number of different species), not composition (which species). We know we have detected 30 different species across all samples, but the maximum richness detected in any sample is 17. Thus, there must have been some changes in the composition (i.e. turnover). We will check it in few minutes!

How does total ant abundance change across years?

(i.e. row summaries in the ants_wide data set)

ants_long %>% 
  group_by(year) %>% 
  summarise(total_abundance = sum(total_abundance)) %>% 
  ggplot(aes(x = year, y = total_abundance)) +
  geom_line() + geom_point(size = 3) +
  geom_smooth(se = F) +
  geom_vline(aes(xintercept = 2005, color = "logging"), linetype = "dashed") + # indicate logging year
  guides(color = guide_legend(title = NULL)) +
  scale_x_continuous(breaks = function(x) seq(ceiling(x[1]), floor(x[2]), by = 2)) +
  labs(y = "Number of ants\nacross all species")

We are finding quite cyclical patterns, right? With a decrease in richness and total abundance after the logging, followed by an increase until 2013-2015 and a final decrease.

So it seems that the community is changing both in terms of which species contains and how many individuals are there.

How does total ant abundance change among species

(i.e. column summaries in the ants_wide data set)?

Let’s have a look to the rank abundance distribution (RAD) of the community: a plot with the absolute or relative abundances of the species ranked from the most to the least abundant ones.

ants_long %>%
  group_by(genus, species) %>% 
  summarise(total_abundance = sum(total_abundance,
                                  na.rm = T)) %>% 
  ggplot(aes(x = reorder(paste(genus, species),
                         -total_abundance), # x axis are complete species names ordered by their abundance
             y = total_abundance)) + 
  geom_col(color = "black", fill = "grey") +
  labs(y = "Number of ants\nacross all years") +
  theme(axis.text.x = element_text(face = "italic",
                                   angle = 90,
                                   vjust = .25,
                                   hjust = 1),
        axis.title.x = element_blank()) 

Most species are rare across the sampling period; only 3-4 species dominate the community.

Is this RAD maintained across years?

# let's have first the order we obtained in the overall RAD
ordered_ab <- ants_long %>%
  group_by(code) %>% 
  summarise(total_abundance = sum(total_abundance, na.rm = T)) %>% 
  arrange(desc(total_abundance))

# and apply this overall order to yearly RADs
ants_long %>% 
  mutate(code = factor(code,levels = ordered_ab$code)) %>%
  split(.$year) %>% 
  #map2(names(.),
  map(
    ~ ggplot(data = .x, aes(x = code, y = total_abundance)) + 
  geom_col(color = "black", fill = "grey") +
  scale_x_discrete(drop = F) +
  ylim(0,130) +
  theme(axis.text.x = element_text(face = "italic", angle = 90, vjust = .25, hjust = 1,
                                   size = 8),
        strip.text = element_text(size = ),
        axis.title.x = element_blank(),
        panel.border = element_rect(fill = NA))
  )
  # + labs(title = .y)) 

Rank order among species changes across years, suggesting rather asynchronous fluctuations in abundance.

As we have seen in the exploration, the community composition is changing in the identities and abundances of the species it contains. These first visualisations are mainly descriptive and quite limited when we want to summarise and capture the more complex and multivariant patterns we are detecting of the community. So let’s dig a bit more into this.

Incidence-based analyses

We will start by analysing the temporal changes in the identities of the species constituting the community.

Turnover analysis: did the taxonomic composition of the community change over time?

The rate of change in the taxonomic composition of a community over time can be quantified with the turnover rate (i.e. the ‘temporal turnover’ or ‘species turnover’). This method can measure when, and how rapidly, community shifts occurred in a temporal sequence and detect complex temporal patterns such as perturbations, tipping points and pivot points. One of the most commonly used turnover rates is calculated as T = (E + C)/R, where T is the percentage turnover rate, E is the number of taxa that went extinct between two time points, C is the number of taxa that colonised the community between the same two time points, and R is the total number of species (i.e. the richness of the pool of species conformed by the two samples). Therefore, turnover rate represents the proportional change in taxon identities over the period and ranges from 0 (no turnover) to 1 (complete turnover). This metric can be calculated only for a single location at two time points (Buckley, Day, Case, et al. 2021).

To do so, we will use the turnover function from the codyn package (from community dynamics). With a given data frame (in the long format), it calculates the rate of appearances, disappearances and total turnover of the community between consecutive samples.

ants_turnover <- ants_long %>% 
  mutate(total_abundance = 1) %>% # to include 2007 in the turnover analysis
  codyn::turnover(time.var = "year",
                  species.var = "code",
                  abundance.var = "total_abundance")


ants_turnover %>% 
  ggplot(aes(x = year, y = total)) +
  geom_point() + geom_line() +
  geom_vline(aes(xintercept = 2005), linetype = "dashed", color = "black") +
  scale_color_viridis_d(direction = -1) +
  scale_x_continuous(breaks = function(x) seq(ceiling(x[1]), floor(x[2]), by = 1)) + 
  labs(title = "Community turnover relative to the preceding sample",
       x = "Year",
       y = "Turnover") +
  ylim(0, 1) +
  theme(axis.text.x = element_text(angle = 30, vjust = 1, hjust = 1))

The annual turnover rate of the community increased and remained quite high after the experimental logging in 2005 (dashed line), with communities between consecutive years sharing less than 50% of species on most occasions (except between 2010 and 2012).

ants_cols <- ants_long %>% 
  mutate(total_abundance = 1) %>% 
  turnover(time.var = "year",
           species.var = "code",
           abundance.var = "total_abundance",
           metric = "appearance")

ants_exts <- ants_long %>% 
  mutate(total_abundance = 1) %>% 
  turnover(time.var = "year",
           species.var = "code",
           abundance.var = "total_abundance",
           metric = "disappearance")


ants_cols %>% 
  left_join(ants_exts) %>% 
  pivot_longer(cols = -year,
               names_to = "metric",
               values_to = "rate") %>%
  ggplot(aes(x = year, y = rate)) +
  geom_col(aes(fill = metric)) +
  geom_point(data = ants_turnover, aes(y = total)) +
  geom_line(data = ants_turnover, aes(y = total)) +
  geom_vline(aes(xintercept = 2005), linetype = "dashed", color = "black") +
  scale_fill_viridis_d(begin = .5) +
  scale_x_continuous(breaks = function(x) seq(ceiling(x[1]), floor(x[2]), by = 1)) + 
  ylim(0,1) + 
  labs(title = "Community turnover relative to the preceding sample",
       x = "Year",
       y = "Turnover") +
  theme(axis.text.x = element_text(angle = 30, vjust = 1, hjust = 1))

Visualising colonisations and extinctions rates gives us more information on the underlying process of these quite high turnover rates. Turnover of the community was caused by both gains and losses in species. Logging was followed by a period of local extinctions (extinctions > colonisations, reducing community richness), followed by one of richness recovery (colonisations > extinctions).


Challenge

The turnover analysis has shown us that local colonisations and extinctions were a constant in the ant community. However, the function turnover from the codyn R package calculates the turnover rates between consecutive samples, thus we cannot know whether the increase in appearance rates was driven by the gain of completely new species or by the gain of species that had already been in the initial community but went temporally extinct (i.e. recolonisations). Can you think of a way to guess it?


Euler diagrams, overlap analysis and time core species

Euler and Venn diagrams are a very good way to visualise the turnover metrics we have just seen. They display two overlapping circles representing the two sampled communities. In an Euler diagram, the circles and overlap areas are proportional to the number of species, while in a Venn diagram all circles and overlap areas are the same size.

There are different r packages to draw Euler diagrams. In this session I’ll use eulerr https://cran.r-project.org/web/packages/eulerr/vignettes/under-the-hood.html

ants_long %>% 
  filter(year %in% c(2004, 2008)) %>% # keep the years of interest
  split(.$year) %>% # create a list with a dataframe per year
  map(select, code) %>% # for each dataframe, select the code variable
  map(unlist) %>% # create a vector from the selected variable
  eulerr::euler() %>% # calculate the parameters for the euler diagram
  plot(quantities = T)  # and plot it

ants_long %>% 
  filter(year %in% c(2004, 2008)) %>% # keep the years of interest
  split(.$year) %>% # create a list with a dataframe per year
  map(select, code) %>% # for each dataframe, select the code variable
  map(unlist) %>% # create a vector from the selected variable
  eulerr::venn() %>% # calculate the parameters for the venn diagram
  plot(quantities = T)  # and plot it

ants_long %>% 
  filter(year %in% c(2004, 2013)) %>% # keep the years of interest
  split(.$year) %>% # create a list with a dataframe per year
  map(select, code) %>% # for each dataframe, select the code variable
  map(unlist) %>% # create a vector from the selected variable
  eulerr::euler() %>% # calculate the parameters for the euler diagram
  plot(quantities = T)  # and plot it

The appearance rates we calculated with the turnover function are the ratio between the number in the dark grey area and the total number of species; the disappearance rates, the ratio between the number in the white area and the total number of species; and the total turnover, the sum of the previous two (or 1 - the ratio of the shared species).

These diagrams confirm our previous interpretations: after the logging, the community initially lost 70% of their previous pool. Then, in 2013, it regained the previously lost species and gained new ones. But in 2018, the community only conserved two species from the initial pool and had lost 6 species from the new colonizations.


Task

  1. Euler and Venn diagrams can simultaneously plot more than two samples. Plot all the samples you want and find how many species are shared.

  2. Now we have an idea about which processes drove the changes in the community composition (local extinctions, recolonisations and new gains), but which species were involved in those changes? Overlap analyses return a list of taxa that are shared between two samples. Create an overlap list of the first and last samples and identify which are their overlapping species.

  3. The list of taxa that occur consistently through time for a set of samples constitutes the time core taxa. Are the overlapping species identified in the previous exercise time core species of the community?

*Tom Murray, CC-BY-ND-NC
*Tom Murray, CC-BY-ND-NC

Abundance-based analyses

Dissimilarity measures

Pairwise dissimilarity methods are the most common way biologists quantify the difference in the composition of taxa occurring in two samples (Buckley, Day, Lear, et al. 2021). The result is a single number that measures the distance in compositional space between the two samples. Dissimilarity can also be represented as similarity, that is, 1—dissimilarity. When applied to multiple samples, these methods result in a matrix of all possible comparisons (a triangular dissimilarity matrix).

The turnover rate as calculated in the codyn package is in fact a dissimilarity metric usually known as the Jaccard index. As we saw, this index is incidence-based (only takes into account presence/absence data), but there is also an abundance-based version of the index, also called Ružička index.

ants_wide %>%   # vegdist acts on a community data matrix (i.e. no value other than abundance or presence/absence)
  vegan::vegdist(method = "jaccard", binary = T) # Binary = T: incidence-based calculation
## Error in vegan::vegdist(., method = "jaccard", binary = T): missing values are not allowed with argument 'na.rm = FALSE'

Many distance measures cannot be computed if there are empty samples or missing values. Deleting entire rows or estimating missing data may be an option in some cases. However, other measures, such as the Gower coefficient, can be calculated while ignoring comparisons among samples with missing data

ants_wide %>%  
  na.omit() %>% 
  vegan::vegdist(method = "jaccard", binary = T) %>% # Binary = T: incidence-based calculation
  # we obtain a distance object (i.e. the lower triangle of the distance matrix stored by columns in a vector)
  as.matrix() %>% # from a distance object to a matrix object
  as.data.frame() %>% # from matrix to data.frame
  rownames_to_column(var = "year1") %>% 
  pivot_longer(cols = -year1,
               names_to = "year2",
               values_to = "distance") %>% # we pivot the object for visualisation purposes
  mutate(across(.cols = c(year1, year2), .fns = as.numeric)) %>% 
  ggplot(aes(x = year1, y = year2)) +
  geom_tile(aes(fill = distance), color = "white") +
  geom_text(aes(label = round(distance, 2)), size = 2) + 
  scale_fill_viridis_c(option = "B", limits = c(0, 1)) +
  labs(title = "Jaccard dissimilarity index (incidence-based)",
       subtitle = "i.e. turnover rate") +
  theme(axis.text.x = element_text(angle = 30, hjust = .5, vjust = .5),
        axis.title.x = element_blank(),
        axis.title.y = element_blank())

ants_wide %>% 
  na.omit() %>% 
  vegan::vegdist(method = "jaccard") %>% # default is binary = F: abundance-based analysis
  as.matrix() %>% 
  as.data.frame() %>% 
  rownames_to_column(var = "year1") %>% 
  pivot_longer(cols = -year1,
               names_to = "year2",
               values_to = "distance") %>%
  mutate(across(.cols = c(year1, year2), .fns = as.numeric)) %>% 
  ggplot(aes(x = year1, y = year2)) +
  geom_tile(aes(fill = distance), color = "white") +
  geom_text(aes(label = round(distance, 2)), size = 2) + 
  scale_fill_viridis_c(option = "B", limits = c(0, 1)) +
  labs(title = "Jaccard dissimilarity index (abundance-based)",
       subtitle = "i.e. Ružička index") +
  theme(axis.text.x = element_text(angle = 30, hjust = .5, vjust = .5),
        axis.title.x = element_blank(),
        axis.title.y = element_blank())

With the incidence-based matrix, we can check that the Jaccard dissimilarity index corresponds to the turnover rate calculated with the turnover function (look at the values from the 2004 row or column, or the inferior or superior subdiagonal). We can also see low distances (high similarities) between the 2010, 2011 and 2012 communities, a period we had described as more stable because of the low turnover values it presented.

When abundance is considered, the patterns of dissimilarity notably change. Taking into account only the identity of the taxa, the most dissimilar year is 2006 (i.e. after the logging) and temoorally closer years tend to be more similar (darker colours are close to the diagonal). Yet, when abundance is considered, the most dissimilar year is 2009 (the year with the lowest total abundance, as we saw in the exploration) and darker colours (higher similarities) tend to be at the edges of the heat map (temporally distant years are similar).

Bonus material: how do I find an appropriate dissimilarity measure?

Since the description of the first floristic similarity coefficient by Paul Jaccard in 1900, community ecologists have developed a broad array of similarity and dissimilarity coefficients. There is no single coefficient that is appropriate for all occasions. The choice should be guided by the properties of coefficients and the objective of the research (Legendre and De Cáceres 2013).

The Euclidian distance and the double-zero problem


Task

When measuring the distance between two points, our geometrically-trained brain tends to calculate it following the Pythagorean theorem (i.e. the Euclidian distance). Use the vegdist function from the vegan package to calculate the Euclidian distance for all sample combinations of the community. What do you think? Do you find this metric appropriate for analysing community dissimilarity? Why?


Appropriate dissimilarity indices for community analyses

Another aspect to consider when choosing a dissimilarity measure is how it treats abundant and rare species. In indices like Bray-Curtis or Kulczynski equal differences in abundant and rare species contribute equally to the distance between sites, while in other indices like Canberra or Clark, differences in rare species contribute more to distance. This could be especially relevant when we consider that rare species are indicative of special environmental conditions.

ants_wide %>% 
  na.omit() %>% 
  vegan::vegdist(method = "bray") %>% 
  as.matrix() %>% 
  as.data.frame() %>% 
  rownames_to_column(var = "year1") %>% 
  pivot_longer(cols = -year1,
               names_to = "year2",
               values_to = "distance") %>% 
  mutate(across(.cols = c(year1, year2), .fns = as.numeric)) %>% 
  ggplot(aes(x = year1, y = year2)) +
  geom_tile(aes(fill = distance), color = "white") +
  scale_fill_viridis_c(option = "B", limits = c(0, 1)) +
  labs(title = "Bray-Curtis dissimilarity index",
       subtitle = "Abundant and rare species contribute equally") +
  theme(axis.text.x = element_text(angle = 30, hjust = .5, vjust = .5),
        axis.title.x = element_blank(),
        axis.title.y = element_blank())

ants_wide %>% 
  na.omit() %>% 
  vegan::vegdist(method = "canberra") %>% 
  as.matrix() %>% 
  as.data.frame() %>% 
  rownames_to_column(var = "year1") %>% 
  pivot_longer(cols = -year1,
               names_to = "year2",
               values_to = "distance") %>% 
  mutate(across(.cols = c(year1, year2), .fns = as.numeric)) %>% 
  ggplot(aes(x = year1, y = year2)) +
  geom_tile(aes(fill = distance), color = "white") +
  scale_fill_viridis_c(option = "B", limits = c(0, 1)) +
  labs(title = "Canberra dissimilarity index",
       subtitle = "Rare species contribute more to distance") +
  theme(axis.text.x = element_text(angle = 30, hjust = .5, vjust = .5),
        axis.title.x = element_blank(),
        axis.title.y = element_blank())

When rare species are given more weight, distance values increase. This is rather expectable, as this community presents a high proportion of rare species and high turnover rates.

Ordination

One of the most common applications of distance coefficients is ordination analyses, where we try to place in a space of reduced dimensions a series of points such that more similar points are closer and more dissimilar points are farther.

Because ordination methods implicitly (PCA, CA) or explicitly (PCoA) use a distance function as their metric to position objects relative to one another in ordination space, it is important to make sure that the chosen distance coefficient is meaningful for the objects under study, especially when dealing with community composition data. By choosing an appropriate distance measure, an ecologist tries to appropriately model the relationships among the samples for the data at hand. The choice of a similarity or distance measure is an ecological, not a statistical decision (Legendre and Legendre 2012).

Principal coordinates analysis (PCoA), a.k.a. Classical Multidimensional Scaling

Principal component analysis (PCA) is only applicable to data for which the Euclidean distance is appropriate, whereas correspondence analysis (CA) is only applicable to frequency-like data for which the Chi-squared distance is appropriate. For other types of data, the principal coordinate analysis (abbreviated PCoA) represents, in an Euclidean space (i.e. a Cartesian coordinate system), a set of objects whose relationships are measured by any distance coefficient chosen by users.

(pcoa_hell <- ants_wide %>% 
  na.omit() %>% 
  vegan::vegdist(method = "hellinger") %>%
  # this is a common metric in PCoA (it gives the same weight to all species when the total number of individuals is the same between samples)
  cmdscale(eig = T))
## $points
##               [,1]        [,2]
## 2003 -0.2856888552 -0.22368591
## 2004 -0.5904191417  0.07322758
## 2005  0.0090061419 -0.28474779
## 2006  0.0339015785 -0.35844625
## 2008  0.1438023493 -0.23383484
## 2009  0.2711902477 -0.17579320
## 2010  0.1990437517 -0.04789914
## 2011  0.4494105671  0.18229013
## 2012  0.2299152322  0.22318800
## 2013  0.0008254706  0.39546085
## 2014  0.2804121468  0.15839024
## 2015 -0.2644323570  0.21692818
## 2018 -0.4769671320  0.07492214
## 
## $eig
##  [1]  1.196170e+00  6.700295e-01  4.381824e-01  2.460571e-01  1.409137e-01
##  [6]  1.152012e-01  8.963040e-02  6.568090e-02  4.062883e-02  3.169604e-02
## [11]  1.750869e-02  1.179310e-02 -5.306019e-17
## 
## $x
## NULL
## 
## $ac
## [1] 0
## 
## $GOF
## [1] 0.6091739 0.6091739
# The variation explained by each axis is its eigenvalue divided by the sum of the positive eigenvalues.
var_exp <- pcoa_hell$eig[1:2]/sum(pcoa_hell$eig[pcoa_hell$eig > 0])

pcoa_years <- pcoa_hell$points %>% 
  as.data.frame()
  
pcoa_species <- cor(na.omit(ants_wide), pcoa_years) %>% 
  as.data.frame()

# Visualisation in the ordination diagram
ggplot() +
  geom_segment(data = pcoa_species %>% 
              rownames_to_column(var = "sp"),
            aes(x = 0, y = 0, xend = V1, yend = V2),
            color = "grey",
            arrow = arrow()) +
  geom_text_repel(data = pcoa_species %>% 
              rownames_to_column(var = "sp"),
            aes(x = V1, y = V2, label = sp),
            color = "grey",
            max.overlaps = 30) +
  geom_point(data = pcoa_years %>% 
               rownames_to_column(var = "year"),
             aes(x = V1, y = V2),
             size = 2) +
  geom_text_repel(data = pcoa_years %>% 
               rownames_to_column(var = "year"),
             aes(x = V1, y = V2, label = year),
             max.overlaps = 13) +
  geom_hline(aes(yintercept = 0), color = "red", linetype = "dashed") +
  geom_vline(aes(xintercept = 0), color = "red", linetype = "dashed") +
  labs(title = "PCoA using Hellinger distance",
       subtitle = paste0("Variance explained = ",
                         round(sum(var_exp)*100), "%"),
       x = paste0("PCo1 (", round(var_exp[1]*100), "% var)"),
       y = paste0("PCo2 (", round(var_exp[2]*100), "% var)"))

The two main axes of the PCoA explain more than 60% of the variance of composition data, which is fairly good. Positive PCo1 values are associated with the increase in abundance of some Camponotus species, Formica dolosa and Myrmica nearctica, and the decrease in some Lasius species, Aphaenogaster fulva and Temnothorax longispinosus. For the PCo2, positive values are associated with the increase in abundance of some Formica and Camponotus species, among others, and the decrease in Camponotus herculeanus. Notably, Aphaenogaster picea, a common species in all samples, doesn’t contribute much to explain changes in community composition.

Points that are close in the ordination diagram are similar in terms of community composition (e.g. 2005, 2006 and 2008), while distant points are more dissimilar (e.g. 2004 and 2011).

Trajectory analyses

The path that connects consecutive sampling years in the ordination diagram represents the trajectory of the community. The analysis of their geometric properties or their comparison can give many insights into community dynamics. You can find an extensive presentation of trajectory analysis in community ecology in (De Cáceres et al. 2019).

ggplot() +
  geom_segment(data = pcoa_species %>% 
              rownames_to_column(var = "sp"),
            aes(x = 0, y = 0, xend = V1, yend = V2),
            color = "grey",
            arrow = arrow()) +
  geom_text_repel(data = pcoa_species %>% 
              rownames_to_column(var = "sp"),
            aes(x = V1, y = V2, label = sp),
            color = "grey",
            max.overlaps = 30) +
  geom_point(data = pcoa_years %>% 
               rownames_to_column(var = "year"),
             aes(x = V1, y = V2),
             size = 2) +
  geom_text_repel(data = pcoa_years %>% 
               rownames_to_column(var = "year"),
             aes(x = V1, y = V2, label = year),
             max.overlaps = 13) +
  geom_path(data = pcoa_years %>% 
               rownames_to_column(var = "year"),
             aes(x = V1, y = V2)) +
  geom_hline(aes(yintercept = 0), color = "red", linetype = "dashed") +
  geom_vline(aes(xintercept = 0), color = "red", linetype = "dashed") +
  labs(title = "PCoA using Hellinger distance",
       subtitle = paste0("Variance explained = ", round(sum(var_exp)*100), "%"),
       x = paste0(round(var_exp[1]*100), "% var"),
       y = paste0(round(var_exp[2]*100), "% var"))

The trajectory seems to show a quite circular pattern, in the anti-clockwise direction. Thus, more distant communities would be those separated by half of the total sampling period. This can be more easily analysed with a time-lag analysis.

Time-lag analysis

A time-lag analysis involves relating the amount of change in composition (calculated with a dissimilarity index) to the amount of change in time across increasing temporal distances, called ‘lags’.

hell_vector <- ants_wide %>% 
  na.omit() %>% 
  vegdist(method = "hellinger") %>% 
  as.numeric()

lag_vector <- ants_wide %>% 
  na.omit() %>% 
  rownames() %>% 
  as.numeric() %>% 
  vegdist(method = "euclidean") %>% 
  as.numeric()

lag_df <- data.frame(hell_dist = hell_vector,
                     time_lag = lag_vector)

linear <- lm(hell_dist ~ time_lag, data = lag_df)
summary(linear)
## 
## Call:
## lm(formula = hell_dist ~ time_lag, data = lag_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34510 -0.12842 -0.01021  0.09568  0.37174 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.585038   0.033491  17.469  < 2e-16 ***
## time_lag    0.019968   0.005192   3.846 0.000248 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1593 on 76 degrees of freedom
## Multiple R-squared:  0.1629, Adjusted R-squared:  0.1519 
## F-statistic: 14.79 on 1 and 76 DF,  p-value: 0.0002477
lag_df %>% 
  ggplot(aes(x = time_lag, y = hell_dist)) +
  geom_point(alpha = .5, size = 4) +
  geom_smooth(method = "lm", se = F, size = 1) + # fit a regression line
  scale_x_continuous(breaks = seq(2, 12, by = 2)) +
  labs(x = "Time lag (years)",
       y = "Hellinger distance")

As we can see in the linear model, community composition tends to diverge with time (i.e. more distant time points present more dissimilar communities than consecutive or close time points).


Task

In the plot we can see that in higher time lags, the Hellinger distance is lower than expected from a linear fit. Can you find a better fit? How do you interpret the resulting pattern?

References

Buckley, Hannah L., Nicola J. Day, Bradley S. Case, and Gavin Lear. 2021. “Measuring Change in Biological Communities: Multivariate Analysis Approaches for Temporal Datasets with Low Sample Size.” PeerJ 9 (April): e11096. https://doi.org/10.7717/peerj.11096.
Buckley, Hannah L., Nicola J. Day, Gavin Lear, and Bradley S. Case. 2021. “Changes in the Analysis of Temporal Community Dynamics Data: A 29-Year Literature Review.” PeerJ 9 (April): e11250. https://doi.org/10.7717/peerj.11250.
De Cáceres, Miquel, Lluís Coll, Pierre Legendre, Robert B. Allen, Susan K. Wiser, Marie Josée Fortin, Richard Condit, and Stephen Hubbell. 2019. “Trajectory Analysis in Community Ecology.” Ecological Monographs 0 (0): 1–20. https://doi.org/10.1002/ecm.1350.
Ellison, Aaron. 2021. “Ant Assemblages in Hemlock Removal Experiment at Harvard Forest Since 2003.” Environmental Data Initiative. https://doi.org/10.6073/PASTA/34C6A35E080860CE34A2F35800291280.
Legendre, Pierre, and Miquel De Cáceres. 2013. “Beta Diversity as the Variance of Community Data: Dissimilarity Coefficients and Partitioning.” Ecology Letters 16 (8): 951–63. https://doi.org/10.1111/ele.12141.
Legendre, Pierre, and Louis Legendre. 2012. “Chapter 7 – Ecological Resemblance.” Developments in Environmental Modelling 24: 265–335.
Record, Sydne, Tempest McCabe, Benjamin Baiser, and Aaron M. Ellison. 2018. “Identifying Foundation Species in North American Forests Using Long-Term Data on Ant Assemblage Structure.” Ecosphere 9 (3): e02139. https://doi.org/10.1002/ecs2.2139.
Yang, Louie H. 2020. “Toward a More Temporally Explicit Framework for Community Ecology.” Ecological Research 35 (3): 445–62. https://doi.org/10.1111/1440-1703.12099.
Zuur, Alain F., Elena N. Ieno, and Chris S. Elphick. 2010. “A Protocol for Data Exploration to Avoid Common Statistical Problems.” Methods in Ecology and Evolution 1 (1): 3–14. https://doi.org/10.1111/j.2041-210X.2009.00001.x.

  1. Universitat de Barcelona, @PolCapdevila90↩︎

  2. CREAF, @mariavivesingla↩︎